98 research outputs found

    VARD2:a tool for dealing with spelling variation in historical corpora

    Get PDF
    When applying corpus linguistic techniques to historical corpora, the corpus researcher should be cautious about the results obtained. Corpus annotation techniques such as part of speech tagging, trained for modern languages, are particularly vulnerable to inaccuracy due to vocabulary and grammatical shifts in language over time. Basic corpus retrieval techniques such as frequency profiling and concordancing will also be affected, in addition to the more sophisticated techniques such as keywords, n-grams, clusters and lexical bundles which rely on word frequencies for their calculations. In this paper, we highlight these problems with particular focus on Early Modern English corpora. We also present an overview of the VARD tool, our proposed solution to this problem, which facilitates pre-processing of historical corpus data by inserting modern equivalents alongside historical spelling variants. Recent improvements to the VARD tool include the incorporation of techniques used in modern spell checking software

    Dealing with spelling variation in Early Modern English texts

    Get PDF
    Early English Books Online contains facsimiles of virtually every English work printed between 1473 and 1700; some 125,000 publications. In September 2009, the Text Creation Partnership released the second instalment of transcriptions of the EEBO collection, bringing the total number of transcribed works to 25,000. It has been estimated that this transcribed portion contains 1 billion words of running text. With such large datasets and the increasing variety of historical corpora available from the Early Modern English period, the opportunities for historial corpus linguistic research have never been greater. However, it has been observed in prior research, and quantified on a large-scale for the first time in this thesis, that texts from this period contain significant amounts of spelling variation until the eventual standardisation of orthography in the 18th century. The problems caused by this historical spelling variation are the focus of this thesis. It will be shown that the high levels of spelling variation found have a significant impact on the accuracy of two widely used automatic corpus linguistic methods - Part-of-Speech annotation and key word analysis. The development of historical spelling normalisation methods which can alleviate these issues will then be presented. Methods will be based on techniques used in modern spellchecking, with various analyses of Early Modern English spelling variation dictating how the techniques are applied. With the methods combined into a single procedure, automatic normalisation can be performed on an entire corpus of any size. Evaluation of the normalisation performance shows that after training, 62% of required normalisations are made, with a precision rate of 95%

    Guidelines for normalising early modern English corpora:decisions and justifications

    Get PDF
    Corpora of Early Modern English have been collected and released for research for a number of years. With large scale digitisation activities gathering pace in the last decade, much more historical textual data is now available for research on numerous topics including historical linguistics and conceptual history. We summarise previous research which has shown that it is necessary to map historical spelling variants to modern equivalents in order to successfully apply natural language processing and corpus linguistics methods. Manual and semiautomatic methods have been devised to support this normalisation and standardisation process. We argue that it is important to develop a linguistically meaningful rationale to achieve good results from this process. In order to do so, we propose a number of guidelines for normalising corpora and show how these guidelines have been applied in the Corpus of English Dialogues

    Fool’s Errand:Looking at April Fools Hoaxes as Disinformation through the Lens of Deception and Humour

    Get PDF
    Every year on April 1st, people play practical jokes on one another and news websites fabricate false stories with the goal of making fools of their audience. In an age of disinformation, with Facebook under fire for allowing “Fake News” to spread on their platform, every day can feel like April Fools’ day. We create a dataset of April Fools’ hoax news articles and build a set of features based on past research examining deception, humour, and satire. Analysis of our dataset and features suggests that looking at the structural complexity and levels of detail in a text are the most important types of feature in characterising April Fools’. We propose that these features are also very useful for understanding Fake News, and disinformation more widely

    Lancaster at SemEval-2018 Task 3:Investigating Ironic Features in English Tweets

    Get PDF
    This paper describes the system we submitted to SemEval-2018 Task 3. The aim of the system is to distinguish between irony and non-irony in English tweets. We create a targeted feature set and analyse how different features are useful in the task of irony detection, achieving an F1-score of 0.5914. The analysis of individual features provides insight that may be useful in future attempts at detecting irony in tweets

    Automatically analysing large texts in a GIS environment: The Registrar General’s reports and cholera in the nineteenth century

    Get PDF
    This is the peer reviewed version of the following article: Murrieta-Flores, P., Baron, A., Gregory, I., Hardie, A., & Rayson, P. (2015). Automatically analysing large texts in a GIS environment: The Registrar General’s reports and cholera in the nineteenth century. Transactions in GIS, 19(2), 296-320. DOI: 10.1111/tgis.12106., which has been published in final form at http://onlinelibrary.wiley.com/doi/10.1111/tgis.12106/abstract. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Self-ArchivingThe aim of this article is to present new research showcasing how Geographic Information Systems in combination with Natural Language Processing and Corpus Linguistics methods can offer innovative venues of research to analyze large textual collections in the Humanities, particularly in historical research. Using as examples parts of the collection of the Registrar General’s Reports that contain more than 200,000 pages of descriptions, census data and vital statistics for the UK, we introduce newly developed automated textual tools and well known spatial analyses used in combination to investigate a case study of the references made to cholera and other diseases in these historical sources, and their relationship to place-names during Victorian times. The integration of such techniques has allowed us to explore, in an automatic way, this historical source containing millions of words, to examine the geographies depicted in it, and to identify textual and geographic patterns in the corpus

    The simulated security assessment ecosystem:Does penetration testing need standardisation?

    Get PDF
    Simulated security assessments (a collective term used here for penetration testing, vulnerability assessment, and related nomenclature) may need standardisation, but not in the commonly assumed manner of practical assessment methodologies. Instead, this study highlights market failures within the providing industry at the beginning and ending of engagements, which has left clients receiving ambiguous and inconsistent services. It is here, at the prior and subsequent phases of practical assessments, that standardisation may serve the continuing professionalisation of the industry, and provide benefits not only to clients but also to the practitioners involved in the provision of these services. These findings are based on the results of 54 stakeholder interviews with providers of services, clients, and coordinating bodies within the industry. The paper culminates with a framework for future advancement of the ecosystem, which includes three recommendations for standardisation

    Automatically analysing large texts in a GIS environment:the Registrar General’s reports and cholera in the nineteenth century

    Get PDF
    The aim of this article is to present new research showcasing how Geographic Information Systems in combination with Natural Language Processing and Corpus Linguistics methods can offer innovative venues of research to analyze large textual collections in the Humanities, particularly in historical research. Using as examples parts of the collection of the Registrar General's Reports that contain more than 200,000 pages of descriptions, census data and vital statistics for the UK, we introduce newly developed automated textual tools and well known spatial analyses used in combination to investigate a case study of the references made to cholera and other diseases in these historical sources, and their relationship to place-names during Victorian times. The integration of such techniques has allowed us to explore, in an automatic way, this historical source containing millions of words, to examine the geographies depicted in it, and to identify textual and geographic patterns in the corpus

    Towards Interactive Multidimensional Visualisations for Corpus Linguistics

    Get PDF
    We propose the novel application of dynamic and interactive visualisation techniques to support the iterative and exploratory investigations typical of the corpus linguistics methodology. Very large scale text analysis is already carried out in corpus-based language analysis by employing methods such as frequency profiling, keywords, concordancing, collocations and n-grams. However, at present only basic visualisation methods are utilised. In this paper, we describe case studies of multiple types of key word clouds, explorer tools for collocation networks, and compare network and language distance visualisations for online social networks. These are shown to fit better with the iterative data-driven corpus methodology, and permit some level of scalability to cope with ever increasing corpus size and complexity. In addition, they will allow corpus linguistic methods to be used more widely in the digital humanities and social sciences since the learning curve with visualisations is shallower for non-expert
    • 

    corecore